Wes Anderson’s Script Analysis
library(dplyr)
library(tidyverse)
library(tidytext)
library(scrapex)
library(rvest)
library(httr)
library(tidyr)
library(stringr)
library(ggplot2)
library(xml2)
library(sf)
library(rnaturalearth)
library(scales)
library(reshape2)
library(wordcloud)
library(ggraph)
library(igraph)
library(widyr)Wes Anderson
Wes Anderson is an American filmmaker known for his distinct visual style, eccentric characters, and idiosyncratic stories. The color palettes of a film can reveal a lot about it. For this project I decided to combine text mining methods and color theory to understand Wes Anderson’s storytelling technique. I will analyze the scripts of three of his most popular films to understand Anderson’s stories through his use of words and color. The objective of this project is to make sense of Anderson’s character development and the core themes of each movie as well as exploring common themes between Rushmore (1998), Fantastic Mr. Fox (2009), and Moonrise Kingdom (2012)
Thanks to library(wesanderson) we can access in R to
Anderson’s bright and colorful world, but does his script follow this
style? Are his characters always cheerful and lively? Is the language he
uses always positive? Those who have seen his films know that behind
this facade there is something else, but what do the words of his
scripts tell us by themselves?
Following Karthik’s package statement, let’s create the most indie final assignment for a course.
Scripts
The first step is to install the wesanderson package and
save the movie scripts into variables. The Internet Movie Script Database (IMSDb) is
a website that offers free scripts for movies, television shows, and
other forms of visual media. This database has a collection of scripts
for several Wes Anderson films, including “Rushmore”, “Fantastic
Mr. Fox”, and “Moonrise Kingdom”.
Installation:
#install.packages("wesanderson")
#or
#devtools::install_github("karthik/wesanderson")
library(wesanderson)All the palettes available:
names(wes_palettes)## [1] "BottleRocket1" "BottleRocket2" "Rushmore1" "Rushmore"
## [5] "Royal1" "Royal2" "Zissou1" "Darjeeling1"
## [9] "Darjeeling2" "Chevalier1" "FantasticFox1" "Moonrise1"
## [13] "Moonrise2" "Moonrise3" "Cavalcanti1" "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1" "IsleofDogs2" "FrenchDispatch"
In order to convert the scripts into variables we have to read the
webpage using read_html() and extract the text from. The
function xml_child() is used to obtain the nodes of the
webpage. The code and procedure is the same for all 3 scripts, what we
need is to obtain the character vector with the text of the scripts.
After reading and inspecting IMSDb webpage, I see that the information we want to extract is in the <td> tag with the class scrtext.
- The function
xml_find_all()allow us to extract the specific nodes containing the screenplay text. - The function
xml_text()returns a character vector. - The function
str_split()takes the character vector from the previous step and returns a list split by “\r\n” which represent line breaks and indicate us the start of a new line.
Character names in a script are frequently written in all capital
letters when they are first introduced (e.g., MAX, CAPTAIN
SHARP, LAZY-EYE, FOX). This is a convention used to help the
reader recognize new characters and differentiate them from speech and
other components of the script. However, for text analysis, we do not
want these names repeating themselves every time a character speaks.
Specially since word frequency and the context of words is so important
to understand the text. I removed the all capital names with the
function str_replace_all() to avoid the overrepresentation
of the names on the analysis.
The only thing left to do is to remove all the extra space between
lines. I used grepl() function to search the pattern “^$”,
which represents an empty line or blank space, and keep the lines that
do not contain the pattern by placing a ! character before
grepl().
Rushmore Academy
raw <- read_html("https://imsdb.com/scripts/Rushmore.html") %>% xml_child()
raw## {html_node}
## <head>
## [1] <meta name="viewport" content="width=device-width, initial-scale=1">\n
## [2] <meta name="HandheldFriendly" content="true">\n
## [3] <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n
## [4] <meta http-equiv="Content-Language" content="EN">\n
## [5] <meta name="objecttype" content="Document">\n
## [6] <meta name="ROBOTS" content="INDEX, FOLLOW">\n
## [7] <meta name="Subject" content="Movie scripts, Film scripts">\n
## [8] <meta name="rating" content="General">\n
## [9] <meta name="distribution" content="Global">\n
## [10] <meta name="revisit-after" content="2 days">\n
## [11] <link href="/style.css" rel="stylesheet" type="text/css">\n
## [12] <script type="text/javascript">\r\n var _gaq = _gaq || [];\r\n _gaq.pu ...
rushmore <- raw %>%
xml_find_all("//td[@class='scrtext']") %>%
xml_text() %>%
str_split("\r\n") %>%
.[[1]]
rushmore <- rushmore %>%
str_replace_all(pattern = "^[A-Z[:space:][:punct:]]+$", replacement = "")
rushmore <- rushmore[!grepl("^$", rushmore)] #remove extra spaces
rushmore <- as.data.frame(rushmore[!grepl("^\\s*$", rushmore)]) #create data frame
colnames(rushmore)[1] <- "text" #change the name of the column to next
rushmore <- rushmore %>%
#include a second column to know from which movie the text comes from
mutate(movie="Rushmore") %>%
# Remove the first 3 rows and the last 2 which includes information bout the script but is not text from the script
slice(4:(nrow(.) - 2))
tibble(rushmore)## # A tibble: 1,819 × 2
## text movie
## <chr> <chr>
## 1 A private day school. Twenty 10th grade boys are sitting in desks in g… Rush…
## 2 The teacher, MR. ADAMS, is at the front of the room, finishing a compl… Rush…
## 3 Except when the value of the x coordinate is less than or equal to the… Rush…
## 4 A boy named ISAAC has raised his hand Rush…
## 5 What about that problem? Rush…
## 6 Isaac points to a startling and intricate arrangement of huge numbers … Rush…
## 7 Oh, I really just put that up there as a joke. That's probably the har… Rush…
## 8 How much extra credit is it worth? Rush…
## 9 Well, I've never seen anyone get it right before, including my mentor,… Rush…
## 10 (pause) Rush…
## # … with 1,809 more rows
wes_palette("Rushmore")Moonrise Kingdom
raw <- read_html("https://imsdb.com/scripts/Moonrise-Kingdom.html") %>% xml_child()
raw## {html_node}
## <head>
## [1] <meta name="viewport" content="width=device-width, initial-scale=1">\n
## [2] <meta name="HandheldFriendly" content="true">\n
## [3] <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n
## [4] <meta http-equiv="Content-Language" content="EN">\n
## [5] <meta name="objecttype" content="Document">\n
## [6] <meta name="ROBOTS" content="INDEX, FOLLOW">\n
## [7] <meta name="Subject" content="Movie scripts, Film scripts">\n
## [8] <meta name="rating" content="General">\n
## [9] <meta name="distribution" content="Global">\n
## [10] <meta name="revisit-after" content="2 days">\n
## [11] <link href="/style.css" rel="stylesheet" type="text/css">\n
## [12] <script type="text/javascript">\r\n var _gaq = _gaq || [];\r\n _gaq.pu ...
# Search for all <p> tags with that class in the document
moonrise <- raw %>%
xml_find_all("//td[@class='scrtext']") %>%
xml_text() %>%
str_split("\r\n") %>%
.[[1]]
moonrise <- moonrise %>%
str_replace_all(pattern = "^[A-Z[:space:][:punct:]]+$", replacement = "")
moonrise <- moonrise[!grepl("^$", moonrise)] #remove extra spaces
moonrise <- as.data.frame(moonrise[!grepl("^\\s*$", moonrise)])
colnames(moonrise)[1] <- "text"
moonrise <- moonrise %>%
mutate(movie="Moonrise Kingdom") %>%
slice(4:(nrow(.) - 2))
tibble(moonrise)## # A tibble: 3,149 × 2
## text movie
## <chr> <chr>
## 1 " A landing at the top of a crooked, wooden staircase. There … Moon…
## 2 " a threadbare, braided rug on the floor. There is a long, wi… Moon…
## 3 " corridor decorated with faded paintings of sailboats and" Moon…
## 4 " battleships. The wallpapers are sun-bleached and peeling at" Moon…
## 5 " the corners except for a few newly-hung strips which are" Moon…
## 6 " clean and bright. A small easel sits stored in the corner." Moon…
## 7 " Outside, a hard rain falls, drumming the roof and rattling" Moon…
## 8 " the gutters." Moon…
## 9 " A ten-year-old boy in pajamas comes up the steps carefully" Moon…
## 10 " eating a bowl of cereal as he walks. He is Lionel. Lionel" Moon…
## # … with 3,139 more rows
wes_palette("Moonrise3")Fantastic Mr. Fox
raw <- read_html("https://imsdb.com/scripts/Fantastic-Mr-Fox.html") %>% xml_child()
raw## {html_node}
## <head>
## [1] <meta name="viewport" content="width=device-width, initial-scale=1">\n
## [2] <meta name="HandheldFriendly" content="true">\n
## [3] <meta http-equiv="content-type" content="text/html; charset=UTF-8">\n
## [4] <meta http-equiv="Content-Language" content="EN">\n
## [5] <meta name="objecttype" content="Document">\n
## [6] <meta name="ROBOTS" content="INDEX, FOLLOW">\n
## [7] <meta name="Subject" content="Movie scripts, Film scripts">\n
## [8] <meta name="rating" content="General">\n
## [9] <meta name="distribution" content="Global">\n
## [10] <meta name="revisit-after" content="2 days">\n
## [11] <link href="/style.css" rel="stylesheet" type="text/css">\n
## [12] <script type="text/javascript">\r\n var _gaq = _gaq || [];\r\n _gaq.pu ...
# Search for all <p> tags with that class in the document
fox <- raw %>%
xml_find_all("//td[@class='scrtext']") %>%
xml_text() %>%
str_split("\r\n") %>%
.[[1]]
fox <- fox %>%
str_replace_all(pattern = "^[A-Z[:space:][:punct:]]+$", replacement = "")
fox <- fox[!grepl("^$", fox)] #remove extra spaces
fox <- as.data.frame(fox[!grepl("^\\s*$", fox)])
colnames(fox)[1] <- "text"
fox <- fox %>%
mutate(movie="Fantastic Mr. Fox") %>%
slice(4:(nrow(.) - 2))
tibble(fox)## # A tibble: 2,776 × 2
## text movie
## <chr> <chr>
## 1 " An apple tree stands alone at the top of a hill. A handsome" Fant…
## 2 " fox dressed in an Edwardian-style navy velvet suit leans" Fant…
## 3 " against it with his arms folded and his legs crossed, chewi… Fant…
## 4 " on a reed of wild grass. He holds an apple core in his paw." Fant…
## 5 " He spits out a seed. He looks off across a meadow that" Fant…
## 6 " descends into the valley below." Fant…
## 7 " A female fox strides briskly up the hill. Her coat is a" Fant…
## 8 " paler, especially beautiful shade of fox-red, and she wears" Fant…
## 9 " men's trousers and a dark tunic. Fox says as she approaches… Fant…
## 10 " What'd the doctor say?" Fant…
## # … with 2,766 more rows
wes_palette("FantasticFox1")script <- rbind(moonrise,fox,rushmore) #I combine in one variable the thee scripts
unique(script$movie)## [1] "Moonrise Kingdom" "Fantastic Mr. Fox" "Rushmore"
After obtaining the text from each script and taking a first look at the colors of each film, we can finally start with the text analysis.
Tokenization
As we saw in class, tokenization is the process of breaking down a piece of text into individual units called tokens, which are meaningful units of text.
wes_script <- script %>%
unnest_tokens(word, text)Stop words have grammatical meaning but do not add to text mining
(e.g.,“the”, “or”, “a”). Thus the next step is to filter them by using
anti_join()
wes_script <- wes_script %>%
anti_join(stop_words)## Joining with `by = join_by(word)`
wes_words <- wes_script %>%
count(movie, word, sort = TRUE)
wes_words## movie word n
## 1 Fantastic Mr. Fox fox 488
## 2 Rushmore max 487
## 3 Moonrise Kingdom sam 285
## 4 Moonrise Kingdom suzy 238
## 5 Rushmore blume 215
## 6 Moonrise Kingdom scout 193
## [ reached 'max' / getOption("max.print") -- omitted 9100 rows ]
Counting word frequencies
We want to know how many times a word is mentioned in Anderson’s
movies, so we use the function count().
rushmore_words <- wes_script %>%
filter(movie =="Rushmore") %>%
count(movie, word, sort = TRUE)
rushmore_words## movie word n
## 1 Rushmore max 487
## 2 Rushmore blume 215
## 3 Rushmore miss 128
## 4 Rushmore cross 125
## 5 Rushmore dirk 87
## 6 Rushmore fischer 67
## [ reached 'max' / getOption("max.print") -- omitted 2543 rows ]
In Rushmore we see that the two most frequent words are Max and Blume, character names. The next two are a little bit confusing if you have not seen the movies since they go together and refer to another really important character in this movie, Miss Rosemary Cross.
After the names of the characters, the most repeated words that begin to give us some clues about the overall tone of this film are dirk, pause, nods, smiles, silence, hands, starts, eyes, and school. Many of the words we see describe the actions or movements the actors have to follow. However, we can perceive in this first reading a contemplative, intimate or introspective tone with moments of silence, nods, and eye contact indicating characters are reflecting on their thoughts and feelings. The world school is also a great indicator of where this story takes place.
wes_script %>%
filter(movie =="Rushmore") %>%
count(movie, word, sort = TRUE) %>%
#only words mentioned more than 35 times in the script
filter(n >= 40) %>%
#we reorder words by number of mentions
mutate(word = reorder(word, n)) %>%
#we create the plot with the word (x) and the number of mentions (y)
ggplot(aes(n, word, fill = word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(21, name = "Rushmore", type = "continuous"), name = "") +
#reverse the order of the legend
guides(fill = guide_legend(reverse = TRUE))moonrise_words <- wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
count(movie,word, sort = TRUE)
moonrise_words## movie word n
## 1 Moonrise Kingdom sam 285
## 2 Moonrise Kingdom suzy 238
## 3 Moonrise Kingdom scout 193
## 4 Moonrise Kingdom master 157
## 5 Moonrise Kingdom captain 146
## 6 Moonrise Kingdom ward 141
## [ reached 'max' / getOption("max.print") -- omitted 3399 rows ]
In the case of Moonrise Kingdom, we see that, once again, the two most common words are our main characters’ names: Sam and Suzy. However, the third word, scout, is a really good hint about who the main characters are, and, as was also in the case of Rushmore, we find some words that seem to point to something but are character names: Scout Master Randy Ward, Captain Sharp or Skotak. Nonetheless, the words also show the hierarchical ranking organizations like the Scouts normally use. The division of power and the names used by the Scouts comes from the military, thus the use of words like troop, services or commander fits in the scenario where this story takes place.
wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
count(movie, word, sort = TRUE) %>%
filter(n >= 40) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(21, name = "Moonrise3", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))fox_words <- wes_script %>%
filter(movie =="Fantastic Mr. Fox") %>%
count(movie,word, sort = TRUE)
fox_words## movie word n
## 1 Fantastic Mr. Fox fox 488
## 2 Fantastic Mr. Fox kylie 145
## 3 Fantastic Mr. Fox ash 119
## 4 Fantastic Mr. Fox bean 79
## 5 Fantastic Mr. Fox kristofferson 73
## 6 Fantastic Mr. Fox badger 64
## [ reached 'max' / getOption("max.print") -- omitted 3146 rows ]
In our third script, we also find as the most frequent words charter names but, moreover, the species of the protagonists: the fox. After these words we find more earthy concepts than in the previous scripts: hole, rat, paw, farmers, apple, chicken, air, tree, cider, mountain, and cuss. The protagonist of this movie are animals, not human, and because they are “anthropomorphic animals” we see a mix of human and animal/nature concepts of the script.
But, what does cuss means?
fox %>%
filter(str_detect(text, "cuss"))## text movie
## 1 Oh, cuss. What time is it? I'm sorry. Fantastic Mr. Fox
## 2 stinks like cuss, plus moving into the Fantastic Mr. Fox
## 3 Bull-cuss! I'm sugar-coating it, man! Fantastic Mr. Fox
## 4 the biggest cusshole I've ever met in my Fantastic Mr. Fox
## 5 The cuss you are! Fantastic Mr. Fox
## 6 The cuss am I? Fantastic Mr. Fox
## 7 Don't cussing point at me! Fantastic Mr. Fox
## 8 Are you cussing with me? Fantastic Mr. Fox
## 9 Do I look like I'm cussing with you? Fantastic Mr. Fox
## 10 A few beagles, as we discussed, but we're Fantastic Mr. Fox
## [ reached 'max' / getOption("max.print") -- omitted 33 rows ]
Cuss is a great word to analyze. For non-native English speakers it does not make a lot of sense if we see the word by itself. We need the context of this word to understand what it means. When we read the whole sentence we realize that cuss is used for cursing, it seems like a family-friendly substitute for fuck or shit.
- Oh cuss. What time is it?
- Stinks like cuss
- Bull-cuss!
- The cuss you are!
- Are you cussing with me?
- Let’s kick some fox cuss!
wes_script %>%
filter(movie =="Fantastic Mr. Fox") %>%
count(movie, word, sort = TRUE) %>%
filter(n >= 40) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(21, name = "FantasticFox1", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))Comparing the scripts
Anderson’s scripts often feature highly specific and detailed descriptions of settings and characters, which help to create a strong sense of visual style and tone.
frequency <- wes_script %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
#we count number of mentions of a word in each movie
count(movie, word) %>%
#we calculate proportion
group_by(movie) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
#we reshape the dataframe
#pivot wider means: more columns, less rows
pivot_wider(names_from = movie, values_from = proportion) %>%
#pivot longer means: more rows, less columns
pivot_longer(`Moonrise Kingdom`:`Fantastic Mr. Fox`,
names_to = "movie", values_to = "proportion")
frequency## # A tibble: 12,234 × 4
## word Rushmore movie proportion
## <chr> <dbl> <chr> <dbl>
## 1 a NA Moonrise Kingdom 0.0000952
## 2 a NA Fantastic Mr. Fox 0.000563
## 3 abruptly NA Moonrise Kingdom 0.000190
## 4 abruptly NA Fantastic Mr. Fox 0.000225
## 5 accent 0.000505 Moonrise Kingdom NA
## 6 accent 0.000505 Fantastic Mr. Fox 0.000225
## 7 accountant NA Moonrise Kingdom NA
## 8 accountant NA Fantastic Mr. Fox 0.000113
## 9 achilles NA Moonrise Kingdom NA
## 10 achilles NA Fantastic Mr. Fox 0.000113
## # … with 12,224 more rows
There is a new column named “Rushmore” with proportions of the movie because it is our reference. We are comparing the words used in Moonrise Kingdom and Fantastic Mr. Fox with the ones used in Rushmore.
ggplot(frequency, aes(x = proportion, y = `Rushmore`,
color = abs(`Rushmore` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 0.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = wes_palette("Rushmore", n = 20, type = "continuous")[1],
high = wes_palette("Rushmore", n = 20, type = "continuous")[20]) +
facet_wrap(~movie, ncol = 2) +
theme(legend.position="none") +
labs(y = "Rushmore", x = NULL)## Warning: Removed 10072 rows containing missing values (`geom_point()`).
## Warning: Removed 10074 rows containing missing values (`geom_text()`).
The plot helps to understand which words have in common the different scripts. The words that are close to the line, have similar frequencies in both sets of texts. The further they are to the right and to the upper zone of the plot, the more frequent they appear in both scripts. For example, Fantastic Mr. Fox and Rushmore have in the high frequency area words like door, head and black. We can glean a little more information from the comparison with Moonrise Kingdom by finding words like cigarette, bloody, and died, more explicit words. The words that are in the low left part of the plot are in the low frequency area. Lastly, words that are far from the line are words that are found more in one set of texts than another. For instance, it is interesting to observe that the word kid and its plural is used much more in Rushmore than in Moonrise Kingdom since the main characters of the latter are younger than the protagonist of Rushmore. It is noteworthy because Rushmore is a coming of age movie in which Max Fischer, the protagonist is an eccentric 15-year-old who is constantly interacting with older people: he falls in love with a first-grade teacher, Miss Cross, and befriends a wealthy industrialist, Herman Blume.
Word correlation
The Pearson correlation allow us to quantify how similar and different these sets of word frequencies are. This correlation will return a number between -1 and 1 that measures the strength and direction of the relationship between two variables.
cor.test(data = frequency[frequency$movie == "Fantastic Mr. Fox",],
~ proportion + `Rushmore`)##
## Pearson's product-moment correlation
##
## data: proportion and Rushmore
## t = 13.498, df = 1035, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3338913 0.4374762
## sample estimates:
## cor
## 0.3869036
cor.test(data = frequency[frequency$movie == "Moonrise Kingdom",],
~ proportion + `Rushmore`)##
## Pearson's product-moment correlation
##
## data: proportion and Rushmore
## t = 12.1, df = 1123, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2868543 0.3903059
## sample estimates:
## cor
## 0.3396068
Both, Moonrise Kingdom and Fantastic Mr. Fox, are weakly correlated to Rushmore, being Fantastic Mr. Fox (0.388) a little more correlated to Rushmore than Moonrise Kingdom (0.339).
Sentiment analysis
Sentiment analysis is a natural language processing technique to determine the emotional tone of the content. In this case, the goal is to determine whether the scripts are expressing a positive, negative, or neutral sentiment.
Using sentiment analysis on Wes Anderson’s scripts could reveal information about the emotional tone of his films. Wes Anderson’s films are noted for their quirky and enchanting style, and sentiment analysis could contribute to uncovering the underlying emotional themes that are recurrent in his films.
Joy words
Let’s find out which are the most common words of joy in Rushmore, Moonrise Kingdom, and in Fantastic Mr. Fox
#we set nrc lexicon to a variable, filtering by joy
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")rushmore_joy <- wes_script %>%
filter(movie =="Rushmore") %>%
#we combine both lists, NRC and Rushmore's words
inner_join(nrc_joy) %>%
#we count the mentions of each word to find the most frequent
count(word, sort = TRUE) ## Joining with `by = join_by(word)`
rushmore_joy## word n
## 1 love 16
## 2 friend 15
## 3 tree 10
## 4 mother 8
## 5 applause 7
## 6 dance 7
## 7 smiling 6
## 8 beautiful 5
## 9 food 5
## 10 football 5
## [ reached 'max' / getOption("max.print") -- omitted 72 rows ]
rushmore_joy %>%
filter(n > 4) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill=word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(12, name = "Rushmore", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))moonrise_joy <- wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining with `by = join_by(word)`
moonrise_joy## word n
## 1 tree 15
## 2 kitten 13
## 3 beach 11
## 4 church 11
## 5 finally 10
## 6 music 8
## 7 love 7
## 8 true 7
## 9 cove 6
## 10 jump 6
## [ reached 'max' / getOption("max.print") -- omitted 74 rows ]
moonrise_joy %>%
filter(n > 5) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill=word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(10, name = "Moonrise3", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))fox_joy <- wes_script %>%
filter(movie =="Fantastic Mr. Fox") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining with `by = join_by(word)`
fox_joy## word n
## 1 tree 26
## 2 toast 7
## 3 jump 6
## 4 beautiful 5
## 5 love 5
## 6 alive 4
## 7 dance 4
## 8 ecstatic 4
## 9 felicity 4
## 10 food 4
## [ reached 'max' / getOption("max.print") -- omitted 84 rows ]
fox_joy %>%
filter(n > 3) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = n, y = word, fill=word)) +
geom_col() +
labs(y = NULL) +
scale_fill_manual(values = wes_palette(12, name = "FantasticFox1", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))The plots show us that one of the most frequents and recurrent word of joy is tree. Trees appear frequently in Wes Anderson’s movies, serving as a powerful metaphor of nature and growth:
- Rushmore → Trees are part of the scenery, many sequences in the film show Max and other characters walking through tree-lined walks on the Rushmore Academy campus. Trees tend to be used to represent the passage of time and the growth of the characters.
- Moonrise Kingdom → Not only the movie’s setting, a fictional island off the coast of New England, is heavily wooded, but also, as good scouts, the idea of tree houses appears in the script.
- Fantastic Mr. Fox → Trees play a really important part in this film as the family of foxes lives inside one. The trees provide protection for Mr. Fox as he outwits his human rivals and snatches food from their farms throughout the movie.
Other frequent joy words we can observe through these movies are love (recurrent theme in all three movies), dance and jump (actions that we see the characters in these movies frequently doing).
Comparing the sentiments between films
wes_sentiment <- wes_script %>%
#find the sentiment for each word using bing
inner_join(get_sentiments("bing")) %>%
#divide each movie in chunks of 10 lines
#Bing is a multiclass classification approach on just 2 categories: positive and negative.
count(movie, index = row_number() %/% 10, sentiment) %>%
#we write positive and negative in different columns
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
#we substract positive minus negative to find a net sentiment
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
wes_sentiment## # A tibble: 288 × 5
## movie index negative positive sentiment
## <chr> <dbl> <int> <int> <int>
## 1 Fantastic Mr. Fox 112 4 3 -1
## 2 Fantastic Mr. Fox 113 5 5 0
## 3 Fantastic Mr. Fox 114 5 5 0
## 4 Fantastic Mr. Fox 115 6 4 -2
## 5 Fantastic Mr. Fox 116 7 3 -4
## 6 Fantastic Mr. Fox 117 6 4 -2
## 7 Fantastic Mr. Fox 118 7 3 -4
## 8 Fantastic Mr. Fox 119 5 5 0
## 9 Fantastic Mr. Fox 120 8 2 -6
## 10 Fantastic Mr. Fox 121 7 3 -4
## # … with 278 more rows
This information is easier to visualize and understand if we plot it:
#create the plot with x = index (chunks) and y = net sentiment
ggplot(wes_sentiment, aes(index, sentiment, fill = movie)) +
geom_col(show.legend = TRUE) +
facet_wrap(~movie, ncol = 2, scales = "free_x") +
scale_fill_manual(values = wes_palette(3, name = "Rushmore", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))This plot is very interesting because despite Anderson’s aesthetics, the feelings we see in his films are predominantly negative (though I would rather say nostalgic):
Rushmore → Max is a smart but troubled teenager who struggles with his feelings for his teacher and his place in the world. Loss and rejection are key themes in the film.
Moonrise Kingdom → Is a nostalgic contemplation of young love and childhood innocence. But, the film also addresses issues of grief and familial dysfunction, particularly in Suzy’s troubled relationship with her parents. The general tone of the picture is bittersweet, as it depicts a yearning for the purity and innocence of youth.
Fantastic Mr. Fox → The protagonist of this film is aged and regretful wild animal who yearns for the days when he could steal from farmers. Despite being a lighthearted and fun movie, it deals with loss, rebellion, and the weight of the consequences.
Comparing lexicon in Moonrise Kingdom
Since Moontise Kingdom is the movie in which there is the greatest variation in sentiment, let’s compare the lexicons to see the bias of each.
afinn <- wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = row_number() %/% 10) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
#Bing
wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
#we get sentiments from bing
inner_join(get_sentiments("bing")) %>%
#we create the column for bing
mutate(method = "Bing et al."),
#NRC
wes_script %>%
#we get sentiment from nrc
inner_join(get_sentiments("nrc") %>%
#we filter just sentiment, not emotions
filter(sentiment %in% c("positive",
"negative"))
) %>%
#we create the column for nrc
mutate(method = "NRC")) %>%
#we divide in chunks of 10 lines
count(method, index = row_number() %/% 10, sentiment) %>%
#we write positive and negative in different columns
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
#we extract net sentiment by substraction
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 3955 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
#Bind the three of them
bind_rows(afinn,
bing_and_nrc) %>%
#make the plot with x=index (chunks), y=sentiment and fill by lexicon (method)
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")Now we can observe how different narrative plots can be just depending on the sentiment lexicon we use.
- AFINN tell us that the script has predominantly negative words, a lot of low values and very few positive values.
- Bing et al. show more variety and a positive ending after two blocks of negative narrative. It has some of the same negative points as AFINN but interprets much more events as positive.
- NRC has the largest blocks. The beginning is the opposite of what the other two sentiment lexicons tell us. The ends of the script matches slightly with Bing et al.
This graph is a great example of why we can’t blindly trust lexicons, each one has a particular type of bias. In class we learned that there are three possible biases:
- The language is biased toward negative emotions (more negative than positive words).
- The language is biased toward pleasant emotions (more positive than negative words).
- The lexicon is stylistically and contextually distinct from the text under consideration.
Word cloud
A popular technique for visualizing the most frequently occurring words in a piece of text is wordcloud. It is a quick and easy technique to summarize the main themes and topics in a text, allowing us to rapidly identify the most significant ideas and concepts. Thus, let’s see in a different way this recurrent words we have been discussing.
colors_rushmore <- wes_palette("Rushmore", type = "continuous")
#colors_rushmore <-rev(colors_rushmore) #Reverse the colors to avoid excesive beige
wes_script %>%
filter(movie =="Rushmore") %>%
#Filter stopwords
anti_join(stop_words) %>%
#Count words
count(word) %>%
#Use the wordcloud function and add colors as an argument
with(wordcloud(word, n, max.words = 90, colors = colors_rushmore))## Joining with `by = join_by(word)`
colors_moonrise <- wes_palette("Moonrise3", type = "continuous")
wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 90, colors = colors_moonrise))## Joining with `by = join_by(word)`
colors_fox <- wes_palette("FantasticFox1", type = "continuous")
wes_script %>%
filter(movie =="Moonrise Kingdom") %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 90, colors = colors_fox))## Joining with `by = join_by(word)`
After doing a word cloud fora each of the movies, we can also do a comparison word cloud for positive and negative movies in the three scripts.
#Include as stop words some of the words we already saw mostly refer to characters names
custome_stop_words <- bind_rows(tibble(word = c("miss", "master", "sharp"),
lexicon = c("custom")),
stop_words)
wes_script %>%
anti_join(custome_stop_words) %>%
#Get sentiments
inner_join(get_sentiments("bing")) %>%
#Count word mentions
count(word, sentiment, sort = TRUE) %>%
#Establish criteria for size
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
#Paint two wordclouds in one using two different colors
comparison.cloud(colors = c("#9e1906", "#89b151"),
max.words = 100)## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
bing_word_counts <- wes_script %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining with `by = join_by(word)`
bing_word_counts## word sentiment n
## 1 master positive 161
## 2 sharp positive 140
## 3 miss negative 128
## 4 smiles positive 70
## 5 top positive 50
## 6 cuss negative 31
## [ reached 'max' / getOption("max.print") -- omitted 815 rows ]
bing_word_counts %>%
anti_join(custome_stop_words) %>%
group_by(sentiment) %>%
slice_max(n, n = 15) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL) +
scale_fill_manual(values = wes_palette(3, name = "Rushmore", type = "continuous"),
name = "") +
guides(fill = guide_legend(reverse = TRUE))## Joining with `by = join_by(word)`
This plot allows us to see the side by side comparison between the positive and negative words throughout the three scripts without the stop words and the additional ones that refer to character names. It is really nice to see that cuss is in the negative column and the second negative word most used after slowly.
Term frequency
So far, we have seen the word frequency of each script (how many times a word appears in a text). However, there are more sophisticated methods that allows us to better understand Anderson’s narrative style:
- Term frequency (TF) → How frequently a word appears in a document taking its length into account
- Inverse document frequency (IDF) → Measures how many times a word appears in a text compared to how many times it appears in the rest of the collection (relative value)
- TF-IDF → Balanced metric resulted from the multiplication of the previous concepts to measure the real value of a word in a document that is part of a collection.
We previously established how many times each word appeared in each movie, so now we have to sum how many words are in total between the three scripts.
total_words <- wes_words %>%
#Group by movie to sum all the totals in the n column of wes_words
group_by(movie) %>%
#Create a column called total with the total of words by movie
summarize(total = sum(n))
total_words## # A tibble: 3 × 2
## movie total
## <chr> <int>
## 1 Fantastic Mr. Fox 8888
## 2 Moonrise Kingdom 10502
## 3 Rushmore 7927
Then we add this information (total number) to the dataframe with the
function left_join().
wes_words <- left_join(wes_words, total_words)## Joining with `by = join_by(movie)`
wes_words <- wes_words %>%
#Add a column for term_frequency in each novel
mutate(term_frequency = n/total)
wes_words## movie word n total term_frequency
## 1 Fantastic Mr. Fox fox 488 8888 0.05490549
## 2 Rushmore max 487 7927 0.06143560
## 3 Moonrise Kingdom sam 285 10502 0.02713769
## 4 Moonrise Kingdom suzy 238 10502 0.02266235
## [ reached 'max' / getOption("max.print") -- omitted 9102 rows ]
Now we can visualize the information in a plot:
ggplot(wes_words, aes(term_frequency, fill = movie)) +
#Create the bars histogram
geom_histogram(show.legend = TRUE) +
#Set the limit for the term frequency in the x axis
xlim(NA, 0.0009) +
#The Life Aquatic with Steve Zissou is another Anderson's film
scale_fill_manual(values = wes_palette(3, name = "Zissou1",
type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))- **X axis**: different term frequencies for words in the collection of scripts.
- **Y axis**: how many words in each movie present each frequency.
ggplot(wes_words, aes(term_frequency, fill = movie)) +
geom_histogram(show.legend = TRUE) +
xlim(NA, 0.0009) +
facet_wrap(~movie, ncol = 2, scales = "free_y") +
scale_fill_manual(values = wes_palette(3, name = "Zissou1",
type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 486 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 3 rows containing missing values (`geom_bar()`).
The three scripts follow a similar distribution: positive skewed. Most words have very low frequencies (on the left) and a few words have very high frequencies (on the right). Rushmore seems to have the biggest amount of words in the last category of the right, while Moonrise Kingdom has the longest tail.
Zipf’s Law
Zipf’s Law is an empirical law that asserts that the frequency of each word in a large sample of texts is inversely related to its rank in the frequency table. To test it we add column for the rank to the dataframe, ranking the words in descending order by their frequency in the movie script.
freq_by_rank <- wes_words %>%
group_by(movie) %>%
#Create the column for the rank with row_number by movie
mutate(rank = row_number()) %>%
ungroup()
freq_by_rank## # A tibble: 9,106 × 6
## movie word n total term_frequency rank
## <chr> <chr> <int> <int> <dbl> <int>
## 1 Fantastic Mr. Fox fox 488 8888 0.0549 1
## 2 Rushmore max 487 7927 0.0614 1
## 3 Moonrise Kingdom sam 285 10502 0.0271 1
## 4 Moonrise Kingdom suzy 238 10502 0.0227 2
## 5 Rushmore blume 215 7927 0.0271 2
## 6 Moonrise Kingdom scout 193 10502 0.0184 3
## 7 Moonrise Kingdom master 157 10502 0.0149 4
## 8 Moonrise Kingdom captain 146 10502 0.0139 5
## 9 Fantastic Mr. Fox kylie 145 8888 0.0163 2
## 10 Moonrise Kingdom ward 141 10502 0.0134 6
## # … with 9,096 more rows
freq_by_rank %>%
ggplot(aes(rank, term_frequency, color = movie)) +
geom_line(linewidth = 1.1, alpha = 0.8, show.legend = TRUE) +
scale_color_manual(values = wes_palette(3, name = "Moonrise3"), name = "") freq_by_rank %>%
ggplot(aes(rank,term_frequency, color = movie)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
#better to visualize it on logarithmic scales
scale_x_log10() +
scale_y_log10() +
scale_color_manual(values = wes_palette(3, name = "Moonrise3"), name = "") ## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
We can observe how an inversely proportional relationship will have a consistent, negative slope from right to left when plotted in this manner.
Measuring deviation
When we divide this plot into three pieces, we can see that the center to the right is the most stable. To calculate the specific coefficients of the relation between term frequency and rank we use a linear regression:
rank_subset <- freq_by_rank %>%
filter(rank < 500,
rank > 10)
#Use the linear model function to find numeric coefficients of relationship between TF and rank
lm(log10(term_frequency) ~ log10(rank), data = rank_subset)##
## Call:
## lm(formula = log10(term_frequency) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.5727 -0.6704
We use this coefficients to draw a line in the plot to observe the deviation from the standard use of language. This is a great way to visualize the Anderson’s deviation in the use of language, specially in the first section (upper left part).
freq_by_rank %>%
ggplot(aes(rank, term_frequency, color = movie)) +
#we add a line in the plot with the two coefficients we have found
geom_abline(intercept = -0.62, slope = -1.1,
color = "#fe6d7e", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
scale_color_manual(values = wes_palette(3, name = "Moonrise3"), name = "") +
guides(fill = guide_legend(reverse = TRUE))TF-IDF
wes_tf_idf <- wes_words %>%
bind_tf_idf(word, movie, n) %>%
# find the words most distinctive to each document
arrange(desc(tf_idf))
tail(wes_tf_idf, n=20)## movie word n total term_frequency tf idf tf_idf
## 9087 Rushmore version 1 7927 0.0001261511 0.0001261511 0 0
## 9088 Rushmore view 1 7927 0.0001261511 0.0001261511 0 0
## [ reached 'max' / getOption("max.print") -- omitted 18 rows ]
These first words in the dataframe have very high TF-IDF, because they are the most distinctive words of each script ( TF-IDF analysis rewards words that occur few times), and the ones with the low TF-IDF are the words that occur in many of the textx in these script collection (e.g., wood, version, weather, worry, worse, wounded)
Another way to check the most distinctive words of each sript, we just have to arrange the dataframe by TF-IDF:
wes_tf_idf %>%
#Exclude the total column which is not necessary now
select(-total) %>%
#Arrange by tf-idf in descending order
arrange(desc(tf_idf))## movie word n term_frequency tf idf tf_idf
## 1 Rushmore max 487 0.06143560 0.06143560 1.098612 0.06749390
## 2 Fantastic Mr. Fox fox 488 0.05490549 0.05490549 1.098612 0.06031985
## [ reached 'max' / getOption("max.print") -- omitted 9104 rows ]
What we see is that the most distinctive words of each script are their character names. Because character names are that unique.
Since the names of the characters are so distinctive of each script, I have decided to consider them as stop words to see what other words are characterizing each movie.
wes_stop_words <- tibble(word = c("kylie", "ash", "kristofferson", "badger", "boggis", "bunce",
"sam", "suzy", "skotak", "bishop", "master", "captain","ward",
"max", "blume", "miss", "fischer", "margaret", "guggenheim",
"magnus", "rushmore", "dirk", "fox", "scout"))wes_tf_idf %>%
anti_join(wes_stop_words) %>%
group_by(movie) %>%
#choose maximum number of words
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = movie)) +
geom_col(show.legend = FALSE) +
facet_wrap(~movie, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL) +
scale_fill_manual(values = wes_palette(3, name = "Moonrise3"), name = "") +
guides(fill = guide_legend(reverse = TRUE))## Joining with `by = join_by(word)`
What measuring TF-IDF has done here is show us that Wes Anderson uses a very different language across these three scripts. Another way to understand this graphic is to appreciate the distinct and unique worlds Wes Anderson creates. Anderson’s scripts are not only different from one another because of proper nouns or names of people, but for the atmosphere, the story-line, the surroundings. We have the story of a anthropomorphic fox, with human-like qualities, who steals food from three wealthy farmers; the story of two young lovers who run away from home and the community that comes together to search for them; and the story of precocious high school student who is struggling to find his place in the world.
N-grams
If we want to deal with something more than individual units, we recur to n-grams. They are consecutive sequences of words where ‘n’ denotes the number of words constituting a token.
Bigrams are tokens composed by two words:
wes_bigrams <- script %>%
#Tokenize Anderon's scripts into sequences of 2 words
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
#Filter all N/A outputs
filter(!is.na(bigram))
wes_bigrams## movie bigram
## 1 Moonrise Kingdom a landing
## 2 Moonrise Kingdom landing at
## 3 Moonrise Kingdom at the
## 4 Moonrise Kingdom the top
## 5 Moonrise Kingdom top of
## 6 Moonrise Kingdom of a
## 7 Moonrise Kingdom a crooked
## 8 Moonrise Kingdom crooked wooden
## 9 Moonrise Kingdom wooden staircase
## 10 Moonrise Kingdom staircase there
## [ reached 'max' / getOption("max.print") -- omitted 55955 rows ]
The most frequent bigrams in Anderson’s scripts are:
wes_bigrams %>%
count(bigram, sort = TRUE)## bigram n
## 1 of the 345
## 2 in the 320
## 3 on the 228
## 4 mr blume 193
## 5 to the 172
## 6 in a 167
## 7 at the 157
## 8 with a 153
## 9 scout master 146
## 10 out of 137
## [ reached 'max' / getOption("max.print") -- omitted 30481 rows ]
library(tidyr)
bigrams_separated <-wes_bigrams %>%
#Separate each bigram in two columns, word1 and word2
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_separated## movie word1 word2
## 1 Moonrise Kingdom a landing
## 2 Moonrise Kingdom landing at
## 3 Moonrise Kingdom at the
## 4 Moonrise Kingdom the top
## 5 Moonrise Kingdom top of
## 6 Moonrise Kingdom of a
## [ reached 'max' / getOption("max.print") -- omitted 55959 rows ]
#Filter all words included in the word column in stop_words
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered## movie word1 word2
## 1 Moonrise Kingdom crooked wooden
## 2 Moonrise Kingdom wooden staircase
## 3 Moonrise Kingdom threadbare braided
## 4 Moonrise Kingdom braided rug
## 5 Moonrise Kingdom corridor decorated
## 6 Moonrise Kingdom faded paintings
## [ reached 'max' / getOption("max.print") -- omitted 9246 rows ]
New bigram count:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts## word1 word2 n
## 1 scout master 146
## 2 master ward 134
## 3 captain sharp 130
## 4 miss cross 122
## 5 cousin ben 33
## 6 dr guggenheim 31
## [ reached 'max' / getOption("max.print") -- omitted 6930 rows ]
The function unite() is used to reunite the previously
separated bigrams in a single column
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united## movie bigram
## 1 Moonrise Kingdom crooked wooden
## 2 Moonrise Kingdom wooden staircase
## 3 Moonrise Kingdom threadbare braided
## 4 Moonrise Kingdom braided rug
## 5 Moonrise Kingdom corridor decorated
## 6 Moonrise Kingdom faded paintings
## 7 Moonrise Kingdom sun bleached
## 8 Moonrise Kingdom newly hung
## 9 Moonrise Kingdom hung strips
## 10 Moonrise Kingdom easel sits
## [ reached 'max' / getOption("max.print") -- omitted 9242 rows ]
Combine with TF-IDF
bigram_tf_idf <- bigrams_united %>%
#Count by movie
count(movie, bigram) %>%
#Perform tf_idf
bind_tf_idf(bigram, movie, n) %>%
#Arrange in descending order
arrange(desc(tf_idf))
bigram_tf_idf## movie bigram n tf idf tf_idf
## 1 Rushmore miss cross 122 0.04915391 1.098612 0.05400109
## 2 Moonrise Kingdom scout master 146 0.03872679 1.098612 0.04254573
## 3 Moonrise Kingdom master ward 134 0.03554377 1.098612 0.03904882
## [ reached 'max' / getOption("max.print") -- omitted 7047 rows ]
bigram_tf_idf %>%
group_by(movie) %>%
#Maximum number of words set at 10
slice_max(tf_idf, n = 10) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(bigram, tf_idf), fill = movie)) +
geom_col(show.legend = FALSE) +
facet_wrap(~movie, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL) +
#To see some of the other palettes the wesanderson library offers, I plot this graph with the French Dispatch color palette; it is Wes Anderson most recent movie
scale_fill_manual(values = wes_palette(3, name = "FrenchDispatch"), name = "") +
guides(fill = guide_legend(reverse = TRUE))Bigram function
To have a better understanding of bigrams in Anderson’s scripts let’s
incorporate the wes_stop_words to the function we learned
in class to count and visualize n-grams more easily.
count_bigrams <- function(dataset) {
dataset %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)
}
visualize_bigrams <- function(bigrams) {
set.seed(1969)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
bigrams %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a) +
geom_node_point(size = 3, alpha = 0.8) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
}rushmore_bigram <- script %>%
filter(movie=="Rushmore") %>%
count_bigrams()
rushmore_bigram %>%
filter(n>3) %>%
drop_na() %>%
visualize_bigrams() +
geom_node_point(size = 5, color="#032b5f")## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
From the Rushmore bigram plot we can see a big cloud of points coming out of Max. These are Max’s actions throughout the script: Max holds, Max leaves, Max stares, Max waves, Max watches, Max nods smiles sadly (after all he is the protagonist of this story). They tell us a lot about the contemplative character of Max. There are some other really nice bigrams that remind us about the school scenario where the plot takes place like geometry test , yearbook photographer or lower school. And in another way, bigrams such as hand job also tell us about the interest, humor and rumors, that take place on the plot (e.g., “Did you say my mother gave you a hand job?”). There is even a short video compiling these moments.
moonrise_bigram <- script %>%
filter(movie=="Moonrise Kingdom") %>%
count_bigrams()
moonrise_bigram %>%
filter(n>3) %>%
drop_na() %>%
visualize_bigrams() +
geom_node_point(size = 5, color="#f4b5bd")From the Moonrise Kingdom bigram it is worth highlighting how the two nodes with the most neighbors are one again our protagonist: Suzy and Sam. In addition, we have some scout-related bigrams such as walkie talkie, command tent, khaki scouts, and camp ivanhoe.
fox_bigram <- script %>%
filter(movie=="Fantastic Mr. Fox") %>%
count_bigrams()
fox_bigram %>%
filter(n>3) %>%
drop_na() %>%
visualize_bigrams() +
geom_node_point(size = 5, color="#e3782f")
Lastly, in the Fantastic Mr. Fox bigram plot we observe how the node
with most connections is fox. This shows not only fox’s actions but also
a really big danger for the featured family: a fox trap. Some other
examples of interesting bigrams that let us know about the plot of the
story are: wild life, chicken house, double pneumonia, courtyard doors,
camera crew which is related to action 13 (Action 13 camera crew), and
fire truck.
Besides being able to see the words that go together, we can also analyze words that appear in the same context, but not necessarily together. The first thing we need to do for this is to take Anderson’s scripts and split them in section of 10 lines. This way we create a dataframe with the tokenised word and a column that indicates the section where token appears in the scripts.
In order to figure out which pairs of words appear together across
the scripts, in more than one section, we will use the function
pairwise_count() which gives one row for each pair of
words, and the number of times they co-appeared in the same section of
10 lines.
wes_section_words <- script %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
wes_word_pairs <- wes_section_words %>%
pairwise_count(word, section, sort = TRUE)
wes_word_pairs## # A tibble: 699,454 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 sam suzy 110
## 2 suzy sam 110
## 3 master scout 98
## 4 scout master 98
## 5 ward master 93
## 6 master ward 93
## 7 ward scout 92
## 8 scout ward 92
## 9 sharp captain 83
## 10 captain sharp 83
## # … with 699,444 more rows
What we observe in the results are mostly character names. This technique is convenient to know which characters appear on the same context. For example, Sam and Suzy, and Suzy and Sam, are the main characters of Moonrise Kingdom and they seem to share a lot of time in script.
Another thing we can do with this function is to filter by specific words. Previously we saw that the word tree is very present in the three movies, let’s see what other words appear frequently in the context of tree.
wes_word_pairs %>%
filter(item1 == "tree")## # A tibble: 995 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 tree fox 19
## 2 tree stands 10
## 3 tree house 7
## 4 tree door 6
## 5 tree suzy 6
## 6 tree trunk 6
## 7 tree silence 6
## 8 tree hole 6
## 9 tree pause 6
## 10 tree dark 6
## # … with 985 more rows
wes_section_words %>%
pairwise_count(word, section, sort = TRUE) %>%
filter(item1 == "tree") %>%
filter(n >= 6) %>%
mutate(item2 = reorder(item2, n)) %>%
ggplot(aes(n, item2, fill = item2)) +
geom_col() +
labs(y = NULL) +
#Castello Cavalcanti is a short film released in 2013. It is one of the films that has a range of colors most similar to the word we are filtering, tree.
scale_fill_manual(values = wes_palette(16 ,name = "Cavalcanti1", type = "continuous"), name = "") +
guides(fill = guide_legend(reverse = TRUE))Pairwise correlation
We saw in class that correlation between terms indicates how frequently they appear close [compared to] how frequently they appear independently.
The Phi coefficient, which is similar to the Pearson Correlation, will be used to measure this correlation. When looking at a corpus, the Phi coefficient measures how likely it is that two words will appear together after accounting for the chance of each word appearing alone.
wes_word_cors <- wes_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE)
wes_word_cors## # A tibble: 38,612 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 pierce commander 1
## 2 commander pierce 1
## 3 services social 0.979
## 4 social services 0.979
## 5 ben cousin 0.951
## 6 cousin ben 0.951
## 7 sharp captain 0.942
## 8 captain sharp 0.942
## 9 ward master 0.937
## 10 master ward 0.937
## # … with 38,602 more rows
Most of the correlations appear to be between characters. Commander Pierce has 1 as a correlation because is the full name of a character, not two separete words. The perfect correlation indicates that the single words, pierce or commander, are never going to be used in any other way or combined with any other word than those that make up the name of this character.
We can use this function and also filter by particular words. For instances, it would be nice to know to which words the main protagonist of Rushmore, Max, is the most correlated to.
wes_word_cors %>%
filter(item1 == "max")## # A tibble: 196 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 max blume 0.547
## 2 max miss 0.530
## 3 max cross 0.461
## 4 max fischer 0.457
## 5 max dirk 0.451
## 6 max rushmore 0.400
## 7 max smiles 0.329
## 8 max guggenheim 0.299
## 9 max dr 0.294
## 10 max blume's 0.282
## # … with 186 more rows
In Rushmore, the relationship between Max, Blume, and Miss Cross is complicated, as Max builds a friendship with Blume while also falling in love with Miss Cross, resulting in a love triangle that strains their friendships and finally leads to personal growth for all three characters.
Using n-grams for context
We saw in class that when working with negative clauses in sentiment analysis entailed a certain difficulty. Words have a significance until you put the word not infront of them. Thus, we need the context to understand if words are by themselves or preceded by negation words.
bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)## word1 word2 n
## 1 not going 13
## 2 not to 6
## 3 not a 5
## 4 not bad 5
## 5 not get 4
## 6 not so 4
## [ reached 'max' / getOption("max.print") -- omitted 85 rows ]
We could also do a similar analysis to see the context of positive, though sometimes bittersweet, words like love:
bigrams_separated %>%
filter(word1 == "love") %>%
count(word1, word2, sort = TRUE)## word1 word2 n
## 1 love with 5
## 2 love you 5
## 3 love blueberries 2
## 4 love it 2
## 5 love to 2
## 6 love blume 1
## [ reached 'max' / getOption("max.print") -- omitted 9 rows ]
Create a dataframe with AFINN which allow us to see the negated words in Anderson’s scripts associated to a sentiment by themselves.
AFINN <- get_sentiments("afinn")
negation_words <- c("not", "no", "never", "without")
negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
negated_words## word1 word2 value n
## 1 not bad -3 5
## 2 no thanks 2 3
## 3 not fair 2 2
## 4 not sick -2 2
## 5 not true 2 2
## [ reached 'max' / getOption("max.print") -- omitted 19 rows ]
negated_words %>%
mutate(contribution = n * value,
sign = if_else(value > 0, "postive", "negative")) %>%
group_by(word1) %>%
top_n(20, abs(contribution)) %>%
ungroup() %>%
ggplot(aes(y = reorder_within(word2, contribution, word1),
x = contribution,
fill = sign)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~ word1, scales = "free") +
labs(y = 'Words preceeded by a negation',
x = "Contribution (Sent value * number of mentions)",
title = "Most common pos or neg words to follow negations") +
scale_fill_manual(values = c("#9e1906", "#89b151"))
It is interesting to see an absence of positive words neagted in
Anderson’s scripts.
Document-term matrix
The last section of this paper explores the use of the function
cast_dtm() which turns a “tidy”
one-term-per-document-per-row data frame into a DocumentTermMatrix
wes_dtm <- script %>%
unnest_tokens(word, text) %>%
count(movie, word) %>%
#Convert to the matrix
cast_dtm(movie, word, n)
wes_dtm## <<DocumentTermMatrix (documents: 3, terms: 6785)>>
## Non-/sparse entries: 10436/9919
## Sparsity : 49%
## Maximal term length: 22
## Weighting : term frequency (tf)
Note that Anderson has a sparsity of 49%, which means that only half the matrix are zeros.The collection of scripts has low sparsity which indicates that Anderson uses similar vocabulary across the movies, despite creating such unique and distinctive stories. Each story has their own identity but we see a number of recurring terms in Anderson’s writing.
Conclusion
To sum up, the text analysis of Wes Anderson’s scripts has allowed us to understand that behind a colorful and lively facade, complex and thoughtful stories are hidden.
Despite the differences in their settings and plots, all three scripts have one thing in common: they all explore the complexities and sadness of life, even when there are times of joy and fun. They all have Anderson’s distinct visual aesthetic, as well as a concentration on character-driven storylines that explore themes of family, identity, and the search for meaning in life.
Anderson’s stories are melancholy and nostalgic, despite their whimsical and often comic tone. He uses much more negative terms than positive ones. Despite their disparities in place and content, the three scripts share an unifying lexicon and writing style that is distinctly Anderson. Ultimately, the scripts for Rushmore, Moonrise Kingdom, and Fantastic Mr. Fox are excellent examples of Anderson’s distinct style as a screenwriter, demonstrating his ability to convey deeply personal yet broadly relatable stories.